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Abstract. The word position automaton was introduced by Glushkov 
and McNaughton in the early 1960. This automaton is homogeneous and 
has (|| E 11 + 1) states for a word expression of alphabetic width ||E||. 
This kind of automata is extended to regular tree expressions. 

In this paper, we give an efficient algorithm that computes the Follow 
sets, which are used in different algorithms of conversion of a regular ex¬ 
pression into tree automata. In the following, we consider the fc-position 
tree automaton construction. We prove that for a regular expression E of 
a size | E | and alphabetic width || E ||, the Follow sets can be computed 
in 0(|| E || • | E |) time complexity. 


1 Introduction 

This paper is an extended version of [5]. 

Regular expressions, which are finite representatives of potentially infinite 
languages, are widely used in various application areas such as XML Schema 
Languages m , logic and verification, etc. The concept of word regular expres¬ 
sions has been extended to tree regular expressions. 

In the case of words, it is agreed that each regular expression can be trans¬ 
formed into a non-deterministic finite automaton. Computer scientists have been 
interested in designing efficient algorithms for the computation of the position 
automaton. Three well-known algorithms for the computation of this automaton 
exist. The first makes use of the notion of star normal form j2] of a regular expres¬ 
sion . The second is based on a lazy computation technique j3j. The third is built 
on the so-called ZPC-structure m- The complexity of these three algorithms is 
quadratic with regard to the size of the regular expression. 

This study is motivated by the development of a library of functions for 
handling rational kernels [5] in the case of trees. The first problem consists of 
the conversion of a regular expression into a tree automaton. 

Recently Kuske and Meinecke [7] proposed an Algorithm to construct an 
equation automaton ram from a regular tree expression E with an 0(R - | E | 2 ) 
time complexity where | E | is the size of E and R is the maximal rank appearing 
in the ranked alphabet. This algorithm is an adaptation to trees of the one given 
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by Champarnaud and Ziadi in the case of words m • This generalization is 
interesting although the adaptation of the word algorithm to trees is not obvious 
at all. Indeed, the Champarnaud and Ziadi Algorithm, for the construction of the 
set of transitions, is based on the computation of some function called ’’Follow” 
which is not yet defined on trees. Notice that the star normal form of a regular 
tree expression E can not be defined, this notion doesn’t make sense. For these 
reasons the definition of the Follow function in the case of trees is given in 
this paper, while an efficient algorithm for its computation (computation of the 
fc-position tree automaton) is proposed. 

The paper is organized as follows: Section [2] outlines finite tree automata 
over ranked alphabets, regular tree expressions, and linearized regular tree ex¬ 
pressions. Next, in Section[3]the notions of First and Follow of regular expressions 
and the fc-position automaton are recalled. Then, in Section [I] we present an effi¬ 
cient algorithm which builds the fc-position tree automaton with an 0{\ | E 11 • | E |) 
time complexity. Finally, the different results described in this paper are given 
in the conclusion. 

2 Preliminaries 

Let (A, r) be a ranked alphabet , where A is a finite set and r represents the 
rank of £ which is a mapping from £ into N. The set of symbols of rank n 
is denoted by £ n . The elements of rank 0 are called constants. A tree t over 
£ is inductively defined as follows: t = a, t = f(ti,... ,tk) where a is any 
symbol in £q, k is any integer satisfying k > 1, / is any symbol in £k and 
ti,.. .,tfc are any k trees over £. We denote by XV the set of trees over £. A 
tree language is a subset of X £. Let £ > = £\£q denote the set of non-constant 
symbols of the ranked alphabet £. A Finite Tree Automaton (FTA) |4l7l A is a 
tuple (Q, £, Qt, A) where Q is a finite set of states, Qt C Q is the set of final 
states and A C U n >o( < 3 x ^ n x Q") is the set of transition rules. This set is 
equivalent to the function A from Q n x £ n to 2^ defined by {q, f,qi, ■ ■ ■, q n ) G 
A O q G A(qi ,..., q n , /). The domain of this function can be extended to 
(2 < 3) n x £ n as follows: A(Q U ..., Q n , f ) = U( gi ,..,< /n )eQ 1 x-xQ n , Qn, /)• 

Finally, we denote by A* the function from Ts —t 2^ defined for any tree in Tv; 
as follows: A*(t) = A(a) if t = a with a G A 0 , A*(t) = A{A*{t \),..., A*(t n ), f) 
if t = f(t i,..., t n ) with / e £ n and t \,..., t n € Ts- A tree is accepted by A if 
and only if A*(t) n Qt ^ 0- 

The language C(A) recoqnized by A is the set of trees accepted by A i.e. 
C(A) = {t€T s \ A*(t) C\Q t ^ 0}. ' 

For any integer n > 0, for any n languages Li,... ,L n C Ts, and for any 
symbol / € £„, f(L 1; ..., L n ) is the tree language {/(ti,..., t n ) \ U e L z }. The 
tree substitution of a constant c in A by a language L C Tv; in a tree t € Ts, 
denoted by t.{c •<— X}, is the language inductively defined by: L if t = c; {d} if 
t = d where d € £ 0 \ {c}; f(ti{c 4— X},..., t n {c 4— X}) if t = /(ti,..., t n ) with 
/ G £ n and t \,..., t n any n trees over £. Let c be a symbol in £q. The c-product 
Xi - c L 2 of two languages Xi, X 2 C Ts is defined by L\ - C X 2 = U t<£L 1 { t {° L z}}- 
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The iterated c-product is inductively defined for L C TV by: T°° = {c} and 
L(” +1 )c = T"° U L - c L nc . The c-closure of L is defined by L* c = (J n>0 -k"°- 
A regular expression over a ranked alphabet E is inductively defined by 
E = 0, E G E 0 , E = /(Ei, • ■ • , E„), E = (E, +E 2 ), E = (E ± - C E 2 ), E = (Ei*°), 
where c G Eq, n G N, / G E n and Ei,E 2 , ... ,E n are any n regular expression 
s over E. Parenthesis can be omitted when there is no ambiguity. We write 
Ei = E 2 if Ei and E 2 graphically coincide. We denote by RegExp (A) the set of 
all regular expression s over E. Every regular expression E can be seen as a tree 
over the ranked alphabet E U {+, - c , * c | c G E 0 } where + and - c can be seen as 
symbols of rank 2 and * c has rank 1. This tree is the syntax-tree Te of E. We 
denote by |E|j the number of occurrences of a symbol / in a regular expres¬ 
sion E. The alphabetic width || E || of E is the number of occurrences of symbols 
of E > in E ( || E || = I E |y). The size | E | of E is the size of its syn¬ 

tax tree Te- The language [E] denoted by E is inductively defined by [0] = 0, 
[c] = {c}, I/(Ei,E 2 ,--- ,E„)] = /([Ei],..., [E n ]), [Ei+E 2 ] = [Ei]U[E 2 ], 
[Ei- c E 2 ] = [Ei] - c [E 2 ], [Ei*i = [Ei]*- where n G N, Ei,E 2 , ... ,E„ are 
any n regular expression s, / G E n and c G Eq. It is well known that a tree 
language is accepted by some tree automaton if and only if it can be denoted 
by a regular expression m- A regular expression E defined over E is linear 
if every symbol of rank greater than 1 appears at most once in E. Note that 
any constant symbol may occur more than once. Let E be a regular expression 
over E. The linearized regular expression E in E of a regular expression E is 
obtained from E by marking differently all symbols of a rank greater than or 
equal to 1 (symbols of A>). The marked symbols form together with the con¬ 
stants in Eq a ranked alphabet Pos e{E) the symbols of which we call positions. 
The mapping h is defined from Pose (E) to E with /i(Pose (E) m ) C E m for 
every to G N. It associates with a marked symbol fj G Pose (E)> the sym¬ 
bol / G E > and for a symbol c G E 0 the symbol h(c ) = c. We can extend 
the mapping h naturally to RegExp (Pose (E)) —> RegExp (A) by h{a) = a , 
/i(Er+E 2 ) = h(Ei) + /i(E 2 ), h(E 1 - c E 2 ) = h(Ei) - c h(E 2 ), h(E[ c ) = h(Ei )* c , 
h{fj{ Ei,...,E„)) = f(h( Ei),...,/i(E„)), with n G N, a G E 0 , f G E n , fj G 
Pose (E) n such that h(fj) = f and Ei,...,E n any regular expression s over 
Pose (E). 


3 The fc-Position Tree Automaton 

The set of positions associated to E are straightforwardly deduced from the set 
of symbols associated to E. In order to construct a non—deterministic finite 
automaton (position tree automaton) associated to the regular expression E 
that recognizes [E], we need to define two sets, the set First (E) and the set 
Follow((E, fj,k) for a position fj G Pose (E) > . 

In the following of this section, E is a regular expression over a ranked al¬ 
phabet E. The set of symbols in E that appear in an expression F is denoted 
by E F . 
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In this section, we show how to compute the fc-position tree automaton of 
a regular expression E, recognizing [E]. This is an extension of the well-known 
position automaton [B] for word regular expression s where the k represents the 
fact that any k- ary symbol is no longer a state of the automaton, but is exploded 
into k states. The same method was presented independently by McNaughton 
and Yamada [5]. Its computation is based on the computations of particular 
position functions , defined in the following. 

In what follows, for any two trees s and t, we denote by s =4 t the relation 
”s is a subtree of t”. Let t = /(ti,... ,t n ) be a tree. We denote by root(f) the 
root of t, by /c-child(t) the k th child of / in t, that is the root of tk if it exists, 
and by Leaves(f) the set of the leaves of t, i.e. {s G E 0 \ s =4 t}. We denote by 
root(f) the root of t, by fc-child(t) the k th child of / in t, that is the root of tk 
if it exists, and by Leaves(f) the set of the leaves of t, i.e. {s G Eo \ s =4 t}. 

Let E be a regular expression and E its linearized form, 1 < k < m be 
two integers and / be a symbol in E m and fj be a position in Pose (E) m with 

Hfj) = /• 

The set First(E) is the subset of Pose (E) defined by {root(i) G Pose (E) | 
t G JEJ}; The set Follow(E, fj, k) is the subset of Pose (E) defined by {gt G 
Pose (E) | 3t G |Ej,3s f, root(s) = /, fc-child(s) = g,}; The set Last(E) is the 
subset of Pose (E) 0 defined by Last(E) = |^J Leaves(f). 

tep] 

Example 1. Let £ = Eq U Ei U £2 be defined by £q = {a, 6, c}, Ei = {f,h} 
and £2 = {g}- Let us consider the regular expression E and its linearized form 
defined by: 

E = (/(«)*“ - Q b + h(b))*» + g(c , a)*° - c (/(a)*‘ •„ b + 

E= (/i(o)*‘ ■ab + h 2 (b))*'’ +_g 3 (c, a)*° - c (/ 4 (o)*‘ •„ b + h 5 (b))*». 

The language denoted by E is [E] = {b, /i(6), /i(/i(b)), fi(h 2 (b)), h 2 {b), 
h 2 (fi{b)), h 2 (h 2 (b )),..., g 3 (b, a),g 3 (g 3 (b, a), a ), g 3 (U(b), a),g 3 (h 5 {b), a), / 4 (/ 4 (&)), 
f4(h 5 (b),h 5 (f 4 (b)),h 5 (h ± (b)),...}. 

Consequently, First(E) = {b, fi,h 2l g 3 , / 4 , h 5 } and Follow(E, / 1 ,1) = {6, fi,h 2 }, 
Follow (E, h 2l _ 1) = {&, / 1 , h 2 }, Follow(E, g 3 ,l) = {b, g 3 , / 4 , h 5 }, Follow(E, g 3 , 2) = 
{a}, Follow(E, / 4 ,1) = {6, / 4 , h 5 }, Follow(E, h 5 , 1) = {b, / 4 , h 5 }. 

The two functions First and Follow are sufficient to construct the k-position 
tree automaton from a regular expression E. 

Definition 1. [Wj Let E be a regular expression , f and g be symbols in E and 
fj and gi be positions in Pose (E) with h(fj ) = / and h(gi) = g. The fc-Position 
Tree Automaton "Pe the automaton (Q, E , Qt , A) defined by 

Q = {fj I fj e Pos E (E) m A 1 < k < m} 

U {e 1 } with e 1 a new symbol not in £, Qt = {e 1 } 

A = {{fj, h( gi ),g},..., g?) \ g % G Follow(E, fj,k)} 

u {(e 1 ■ • • i/J") I fj G First(E)} 

It has been shown in m that the /c-position tree automaton of E accepts 
[E], hence the following theorem: 
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Theorem 1 . J7Jj| / Let E be a regular expression, then C(Ve) = [E]. 


Example 2. The fc-Position Automaton Te associated with E of Example |Tj is 
given in Figured] The set of states is Q = {e 1 , fl, h\, g\, <?f, f \, hi). The set of 
final states is Qt = {e 1 }. The set of transition rules A is 

f(fi) -»• e 1 , f(fi) -»• fl f(fi) -»• hi 

h{h\) — »■ e 1 , h(h\) —» fl, h(h\) —> h\, 

gidbai) ^ gb aidbdl) ^ s 1 , 

/(/i) £l i /(/i) ^ 93> /(/i) /i> /(/i) -*■ ^b 

b(h\) ~M^Ei) 93; M^Ei) /l: MfyD ^5; 

a ~^ 93; b £l t b fl, b —>• hi, b —¥ g\, b —>• fl, b h\ 

The fc-Position Automaton Pe associated with E is represented in Figured] 


/ /i 



Fig. 1 . The fc-Position Automaton Pe of E = (/(a)*“ -ab+h(b))* b +g(c, a)* c - c (f(a)* a - a 
b + h(b))* b . 

In the following sections, we will show how we can efficiently compute the 
function Follow(E, fj, k). This algorithm can be used in different constructions 
such us the equation automaton [T], k-C-continuation automaton gam and 
Follow Automaton ]T2] . 
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4 Efficient computation of the function Follow 


In 21 Champarnaud and Ziadi gave in the case of words an algorithm with an 
0(|| E || • |E| 2 ) space and time complexity. They enhanced the algorithm to one 
with an 0(| E | 2 ) time and space complexity. In [7], Kuske and Meinecke extend 
the algorithm based on the notion of word partial derivatives [T] to tree partial 
derivatives in order to compute from a regular expression E a tree automaton 
recognizing [EJ. Laugerotte et al. proposed an algorithm for the computation of 
the position tree automaton and the reduced tree automaton in [5]. This is an 
extended version of [8]. In anna Mignot et al. gave an efficient algorithm for 
the computation of the equation automaton using the fc-c-continuations. 

In this section we will describe an algorithm for the computation of the fc- 
position tree automaton based on the computation of the Follow function. 

In the following, we will inductively replace each regular subexpression F*° 
of E by the regular subexpression (F +c)* c . The regular expressions considered 
thereafter are already dealt by this transformation. 

By misuse of language we will denote by First(E) for First(E) and by Follow(E, fj, k) 
for Follow(E, fj, k). Let us first show that the functions First and Follow can be 
inductively computed. 


Lemma 1. fl£f Let E be a linear regular expression. The set First(E) can be 
computed as follows: 


First (Ei - c 


First(O) = 0, First(a) = {a}, 

First(/ J (E 1 , • ■ • ,E m )) = {fj}, 

First(Ei + E 2 ) = First(Ei) U First(E 2 ), 
First(Ei* c ) = First(Ei), 

E x = f (First(Ei) \ {c}) UFirst(E 2 ) if c G [Ei], 
' 1 First(Ei) otherwise. 


Lemma 2. flS i Let E be a linear regular expression, 1 < k < m be two integers 
and fj be a symbol in S m . 

The set of symbols Follow(E, fj, k) can be computed inductively as follows: 
Follow(0, fj,k ) = Follow(a, fj,k) = 0, 

( First(E/c) if gt = fj, 

Follow^(E 1; ..., E m ), fj, k) = < Follow(E;, fj,k) if 31 \ fj G £ El , 

I 0 otherwise. 


Follow(Ei - c E 2 , fj,k ) = < 


( Follow(Ei, fj, k) if fj G T El , 

Follow(Ei + E 2 , fj, k) = < Follow(E 2 , fj,k) if fj G U E2 , 

[ 0 otherwise. 

(Follow(E 1 , fj, k) \ {c}) U First(E 2 ) if fj G I7 El 

Ac G Follow(Ei, fj, k), 

Follow(Ei ,fn,k) if f j G U El 

Ac ^ Follow(E 1 , fj, k), 

Follow(E 2 ,fj,k) if fj G S E2 

Ac G Last(Ei), 

0 otherwise. 
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Follow(E*° f k) = / Follow ( E i>/j’ A; ) uFirst ( El ) */ c £ Follow(Ei , fj,k), 

1 | Follow(Ei, fj, k) otherwise. 

The main idea of our algorithm consists of the separation of the computation of 
the function First (resp. Follow) to the computation of two subsets Fr 0 (resp. 
Fl 0 ) and Fr> (resp. Fl>) that are respectively the projection of the set First 
(resp. Follow) to the positions associated with symbols of a rank 0 and a rank 
greater than 0. 

Thus the computation of the set First (E) can be written as follows: 

First (E) = Fro (E) W Fr> (E). 

Proposition 1. Let E be a linear regular expression and H be a subexpression 
of E. The set of symbols Fro (H) is defined as follows: 


Fr 0 (/i(E lt • • • , E m )) 
Fr 0 (0) = 0, Fr 0 (a) 
Fro (Ei + E2) 

Fro (Ei - c E2) 

Fr 0 (Ei*“) 


0 , 

{a}, 

Fr 0 (Ei) U Fr 0 (E 2 ), 

f (Fr 0 (Ei) \ {c}) U Fr 0 (E 2 ) if c G [Ei], 
( Fro (Ei) otherwise. 

Fr 0 (Ei). 


Proof. Let E be a linear regular expression, 1 < k < m be two integers and fj 
be a symbol in 17®. 


1. If E = 0 or if E = /j(Ei,..., E m ), then Fr 0 (E) = 0 and for E = a, Fr 0 (E) = 
{«}. 

2. Let us prove this proposition for the case E = Ei - C E 2 . 

We have Fr 0 (Ei - c E 2 ) = First(Ei - c E 2 , fj, k) Fl 27 0 

Fi'o (Ei - c E 2 ) = First (Ei - c E 2 ) n 27 0 

_ f ((First(Ei) \ {c}) U First(E 2 )) (~l i7 0 if c G [EiJ, 

( First(Ei) fl Sq otherwise. 

f ((Fr> (Ei) W Fr 0 (Ei) \ {c})) U (Fr> (E 2 ) W Fr 0 (E 2 )) ) Fl S 0 
\ (Fr> (El) W Fr 0 (Ei)) n 

= j (Fr 0 (Ei) \ {c}) U Fr 0 (E 2 ) ) if c G [Ei], 

( Fr 0 (E 2 ) otherwise. 


□ 

The following proposition shows that Fr> (E) can be computed in a similar way 
to the case of words. 


if c G [Ei], 
otherwise. 
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Proposition 2. Let E be a linear regular expression and H be a subexpression 
of E. The set of symbols Fr> (H) is defined as: 


Fr> (a) = Fr> (0) = 0, 

Fr> (MEi,--- ,E m )) = {f j }, 

Fr> (Ei + E 2 ) = Fr> (Ei) l±J Fr> (E 2 ), 
Fr> (Ei) l±) Fr> (E 2 ) 


Fr> (E 


1-cE2) \Fr>(Ei) 
Fr> (Ei* c ) = Fr> (Ei), 


if c € [Ei], 
otherwise. 


Proof. Let E be a linear regular expression. 


1. If E = 0 or if E = fj( Ei,..., E m ), then Fr 0 (E) = 0 and for E = a, Fr 0 (E) = 
{«}• 

2. Let us prove this proposition for the cases E = Ei - c E 2 . 

We have Fr 0 (Ei - c E 2 ) = First(Ei - c E 2 , fj, k) (~l E > 


Fr> (Ei - C E 2 ) = First (Ei - C E 2 ) n £> 

_ ( (First(Ei) \ {c} U First(E 2 )) n U> if c € [Ei], 

\ First (Ei) fl £ > otherwise. 

, = f ( ((Fr> (Ei) W Fr 0 (Ei)) \ {c}) U (Fr> (E 2 ) W Fr 0 (E 2 )) 
\ (Fr> (Ei) ttl Fr 0 (Ei)) n £> 

= f ( ((Fr 0 (Ei) \ {c}) W Fr> (Ei)) U (Fr> (E 2 ) W Fr 0 (E 2 )) 
\ (Fr> (Ei) tt) Fr 0 (Ei)) n £> 

= jFr>(Ei)UFr>(E 2 ) ifce[Ei], 

( Fr> (E 2 ) otherwise. 


nr> ifce[Ei], 
otherwise. 

nr> if c e [Ei], 
otherwise. 


□ 


Let us recall that Fl 0 (E ,fj,k) and Fl> (E ,fj,k) are, respectively, the pro¬ 
jection of the set Follow (E, fj, k) to the symbols associated with symbols of a 
rank 0 and a rank greater than 0. We have: 

Follow (E, /,-, k) = Fl 0 (E, fj,k) W Fl> (E, fj,k) 


Proposition 3. Let E be a linear regular expression, 1 < k < in be two integers 
and fj be a symbol in E^. The function F1q (E, fj , k) can be computed inductively 



as follows: 


Fl 0 ( a,fj,k ) 


Fl 0 (ff»(Ei, • • • ,E m),fj,k) 


Flo (E 1 + E lt fj,k) 


Flo (Ei. c E i,fj,k) 


Flo (0, fj, k) = 0, 

f Fr 0 (Efe) if9i = fj, 

\ Fl 0 (E/, fj , fc) %ffj€S*\ 


< 


Fl 0 (Ei, fj, k) iffj&S E \ 

Fl 0 (E 2 , fj, k) iffjSS^, 

(Fl 0 (Ei, fj, k) \ {c}) U Fr 0 (E 2 ) if fj G H El 

and c G Flo (Ei, k), 

Fl 0 (E 1: fj, k) iffj&£ El 

and c £ Fl 0 (E i,fj,k), 

F1 0 (E 2 ,fj,k) if fj G X 1 ® 2 

and c G Last(Ei), 

0 otherwise. 


Flo (Ei* c , /j, £;) 


(Flo (E u fj,k) \ {c}) U Fr 0 (Ei) if c G Fl 0 (E i,fj,k), 
Flo (Ei , fj , k) otherwise. 


Proof. Let E be a linear regular expression, 1 < k < m be two integers and fj 
be a symbol in . 


1. If E = 0 or if E = a, then Fl 0 (E, fj,k ) = 0. 

Let us prove this proposition for the cases E = Ei - c E 2 and E = E^ c . 

2. Let us consider that E = Ei - C E 2 . 

We have Fl 0 (Ei - c E 2 , fj,k) = Follow(Ei - c E 2 , fj,k) PI T 0 


Fl 0 (Ei - c E 2 , fj,k) = Follow(Ei - C E 2 ,fj,k) fir 0 

( ((Follow(Ei, fj, k) \ {c}) U First(E 2 )) n E 0 
I Follow(Ei, fj, k) fl S 0 
~ | Follow(E 2 , fj, k) n S 0 
0 


if fj G S El A c G Fl 0 (Ei, fj, k) 
if fj G £ E i A c i Flo (Ei, fj, k) 
if fj G S E2 A c G Last (Ei), 
otherwise. 


= < 


(((Fl> (Ei ,fj,k) W Flo (Ei, fj, k)) \ {c})U 
(Fr 0 (E 2 ) tti Fr> (E 2 )) ) n S 0 
(Fl> (Ei, fj, k) l±l Fl 0 (E 1 ,f j ,k))nS 0 
(Fl> (E 2 , f„ k) W Flo (E 2 , fj,k)) n To 
0 


if fj G S El A c G Fl 0 (Ei, fj, k), 


if fj G £ E i A c i Flo (Ei, fj, k), 
if fj G S E2 A c G Last(Ei), 
otherwise. 


r Flo (Ei, fj, k) U Fr 0 (E 2 ) 
I Fl 0 (Ei, fj, k) 

) Fl 0 (E 2 , fj , k) 


if fj G S El and c G Fl 0 (Ei ,fj,k), 
if fj G S El and c ^ Fl 0 (Ei, fj, k), 
if fj G S E2 and c G Last(Ei), 
otherwise. 


9 



3. Let us consider that E = E* c . By definition we have Flo (E* c , fj, k) = 
Follow(E^ c , fj, k) Pi Eq. Then: 


Flo (E = Follow(E* c , fj, k) D S 0 

_ ( ((Follow(Ei, fj, k) \ {c}) U First(Ei)) n E 0 if c G Fl 0 (E 
\ (Follow(Ei, fj, k) fl Eo otherwise. 

( ( (Fl> (Er, fj, k ) l±l Flo (E ufj, k)) \ (c})U if c € Fl 0 (E i, k ), 

= < (Fr 0 (Ei) W Fr> (Ei)) ) n r 0 

[ (Fl> (Ei, fj, k) tt) Fl 0 (Ei, /j, k)) n Sq otherwise. 

= f Flo(E 1 ,/ j ,fc)UFr 0 (E 1 ) if c G Fl 0 (E 1; fj, k)), 

\ Fl 0 (Ei, fj, k) otherwise. 

□ 


Proposition 4. Let E be a linear regular expression, 1 < k < m. be two integers 
and fj be a symbol in 17®. We define inductively the set Fl> (E, fj, k) as follows: 


Fl> (a, fj, k) 
Fl> (5*(E 1? ..., E m ), fj, k) 

Fl> (F + G, fj,k) 


Fl> (F- C G ,fj,k) 


Fl> (0, fj, k) = 0, 

f Fr> (Efc) if9i = fj, 

\F1>(E h fj,k) if fj G T’ Ei • 

/ Fl> (F ,fj,k) iffj&S F , 

\F1>(G ,fj,k) iffj&£ G , 

' Fl> (F, fj, k ) U Fr> (G) if c G Fl 0 (F, fj,k), 
Fl> (F, fj, k) iff s eZ F , 

and c £ Fl 0 (F ,fj,k), 
' F1>(G ,fj,k) iff 3 &£ G 

and c G Last(F), 

^ 0 otherwise. 


Fk t. , 1 = /Fl>(F,/,-,fc)UFr > (F) if c G Fl 0 (F,fj,k), 
(F1>(F, fj,k) otherwise. 


Proof. Let E be a linear regular expression, 1 < k < m be two integers and fj 
be a symbol in 


1. If E = 0 or if E = a, then Fl> (E, fj, k) = 0. 

Let us prove this proposition for the cases E = Ei - c E 2 and E = E^ c . 

2. Let us consider that E = Ei - C E 2 . 
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We have Fl> (Ei - c E2, fj, k ) = Follow(Ei - c E2, fj, k ) n £y 


Fl> (Ei - c E 2 , fj, k) = Follow(Ei - c E 2 ,fj,k) n £■ 


c ,Jj,K) 1 1 


r ( (Follow(E 1 , fj, k) \ {c}) U First(E 2 ) ) D U> if fj £ £ El A 

c £ Flo (El, fj,k), 

Follow (Ex, fj, k) n E > if fj £ £ El A 


if fj e S El A 

c ^ Fl 0 (El ,fj,k), 

if fj £ £ E2 A c £ Last(Ei), 


Follow(E 2 , fj,k) fl £y 


0 


otherwise. 


'( ((Fl > (Ei,/„fc)WFl 0 (Ei,/ J) fc))\{ C })U if fj £ £ El Ac £ Flo (Ei, 
(Fr 0 (E 2 ) W Fr> (E 2 )) ) n £y 



0 


otherwise. 


Fl> (Ei, fj, k) U Fr> (E 2 ) if fj £ ^ 
Fl> (Ei, /j, fc) if fj G ^ El 

Fl>(E 2 ,/,-,fc) if /j £ r E2 


if fj £ r El and c £ Fl 0 (Ei, fj, k)), 
if fj £ T El and c ^ Fl 0 (Ei, /,-,&)), 
if /j £ £ E2 and c £ Last(Ei), 


0 


otherwise. 


3. Let us consider that E = E^A By definition we have Fl> (E* c , fj, k) = 
Follow(E^°, fj, k) PI Sy. Then: 

Fl> (E* c , fj,k) = Follow(Ei°, fj , k) n T> 



□ 


Remark 1. The definition of the set Fl> (E ,fj,k) is identical to the function 
Follow in the case of words m- We have the same formulas. 

The construction of the fc-position tree automaton Ve from the regular ex¬ 
pression as it has been presented in this article complies with the properties of 
the position automaton proposed by Glushkov. This is the generalization of the 
position automaton from words to trees. 

4.1 ZPC-Structure for Follow Computation 

In the word case, the construction of the position automaton, has been developed 
in pTiTTfi] . This construction will be extended to trees in the following. 


11 







Let Te be the syntax tree associated with the regular expression E. 

The set of nodes of Te is written as Nodes(E). For a node v in Nodes(E), 
sym(zv), father(i/), son(V), right (i/) and left(V) denote respectively the symbol, 
the father, the son, the right son and the left son of the node v if they exist. 

We denote by E^ the subexpression rooted at v\ In this case we write ve to 
denote the node associated to E v . Let 7 : Nodes(E) U {_L} —\ Nodes(E) U {_L} 
be the function defined by: 

{ father(;/) if sym(father(i/)) =* c and v/^e 

right(father(^)) if sym(father(^)) = - c 

_L otherwise 

where _L is an artificial node such that 7 (_l_) = _L. The ZPC-Structure is the 

syntax tree equipped with 7 (v) links. 

We extend the relation ^ to the set of nodes of Te: For two nodes p, and v 
we write v =<! p <=> Te„ ==! Te^. We define the set T„(E) = {/i £ Nodes(E) | v ^ 

(i A 7 (yu) y T} which is totally ordered by 

Proposition 5. Let E be linear regular expression, 1 < k < n be two integers 
and f be in E E C\E n . Then Follow(E, /, k) = ((((First(E I/ 0 )- op ( I/ 1 )First(E 7 ( I/1 )))- op ( V2 ) 
First(E 7 („ 2 ))) • • • - 0 p(j/ m ) First(E 7 ( 1/m ))) where Vf is the node of Te labelled by f, 
uq is the /c-child(j'/), T Vf (E) = { 17 ,..., and for 1 < i < m, op{vi) = c such 
that sym(father(i/j)) £ {- G , * c }. 

Proof. By induction over the structure of E. 

1. Let us suppose that E = /(Ei,...,E n ). Then Follow(E, /, k) = First(Efc). 
Since by definition Vf is the root of T E , fc-child(^j) is the root of E „ 0 = E k . 
Hence First(E yo ) = First(E fe ) = Follow(E, /, k). 

2. Let us suppose that E = g(Ei,...,E m ) with g y /, or E = E 1 +E 2 , or 
E = Ei - c E 2 with / £ E E2 . Then Follow(E, /, k) = Follow(Ej, /, k) with 
/ € E Ej . By induction hypothesis, Follow(Ej, /, k) = ((((First (E„ 0 ) • op ( I/1 ) 

First(E 7(i/i) )) - op{v2 ) First(E 7(l/2) ))- op{Vm) First(E 7(l/m) )) where u f is the 

node of Ty labelled by /, vq is the A;-child(i//), T„, (Ej) = { 17 ,..., i/ m } 
and for 1 < i < m,op(vi) = c such that sym(father(r/j)) £ {- c , * c }. 
Since T Ej 4 Te, Follow(E = ((((First(E^) • op(l/l) First(E 7(l/l) )) • op(l/2 ) 
First(E 7 ( I/2 ))) • • • - op ( Vm ) First(E 7 ( Vm ))) where Vf is the node of Te labelled by 
/, vq is the fc-child(^/), T Uf (Ej) = { 17 ,..., is m } and for 1 < i < m, op(ui) = c 
such that sym(father(iy)) £ {- c ,* c }. 

3. Let us suppose that E = Ei- C E 2 with / £ S El (resp. E = E^). Then 
Follow(E, /, k) = Follow (Ei, /, k) - c First(G) with G £ {Ey,E 2 }. By induc¬ 
tion hypothesis, Follow(E 1 , /, k) = ((((First(E^) • op(l/1 ) First(E 7(i/1 ))) • op ( 1/2 ) 
First(E 7 ( I/2 ))) • • •• op („ m )First(E 7 („ m ))) where Vf is the node of Te 1 labelled by 
/, vq is the fc-child(^/), T Uf (Ej) = { 17 ,..., v m } and for 1 < i < m, op(ui) = c 
such that sym(father(i/,;)) £ {- c , * c }. 

Since T El 7 ! T E , by setting H = E „ m+1 and op(u m+1 ) = c, Follow(Ei, /, k) - c 
First(H) ((((First(E„ 0 ) ‘op(i/±) First(E 7 ( i/ i))) ' 0 p(z/ 2 ) Fii'st(E 7 (^ 2 ))) * * * *op(z/ m ) 
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First(E 7 ( J/m ))) • op ( 1/m+1 ) First(E 7 (y m+1 )) where Uf is the node of Te labelled 
by /, vq is the fc-child(V/), T^E) = {ui, ..., u m , u m+ i} and for 1 < i < 
m + 1 ,op(ui) = c such that sym(father(i/,;)) £ {- c , * c }. 

□ 


4.2 Description of the algorithm and complexity 

An implicit construction of the word position automaton, the so-called ZPC- 
structure, has been developed by Ziadi et al. 1X51X61 . Algorithm |T] extends this 
construction to the regular tree expressions. It constructs a forest of trees where 
every tree rooted at a node up represents the set Fr> (F) according to Proposi¬ 
tion [2] 

Algorithm 1: ZPC-Structure Construction 
Input: Regular Expression E. 

Output: ZPC-Structure 
Construct the syntax tree Te of E; 

# 

for each node up on Tp do 
Compute Fr 0 (F); 

_ end for 

# The construction of a First Forest 
for each node up , c g in Te do 

if c ^ Fro (F) then 

Remove the link (up , c g, J'g); 

L end if 
_ end for 

# We have First (fj( Ei,..., E n )) = {fj} 
for each node Uf.( Ei,...e„) in Tp do 

for * = 1 to n do 

Remove the link (t'/.(Ei,...E„)) v ^i)\ 

L end for 
_ end for 

for each node up £ Sq do 
Delete the node up\ 

_ end for 

# 

# The construction of Follow links ( r y v links) 
for each node up. c Q, in Tp do 

create a follow link from up to uq; 

_ end for 

for each node up*c in Tp do 
create a link from up to up*c ; 

_ end for 

return ZPC -Structure 
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Example 3. The syntax tree Te associated with the regular expression E = 
(/i( a )*“ 'a b + h 2 (b))* b + g 3 (c, a)* c - c (/ 4 (a)* a ■ a b + h 5 (b))* b is given in Figured 



Fig. 2. The syntax tree Tg- of E 

The ZPC-Structure associated with E = (/i(a)*“ - a b + h 2 (b))* b +g 3 (c,a)* c ■ c 
(/ 4 (a)* a - a b + h 5 (b))* b is given in Figured 



Fig. 3. The ZPC-Structure of E 
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Theorem 2. The ZPC- Structure associated with E can be computed in 0(\ E |) 
time and space complexity. 

Proof. The first step of our Algorithm Q] consists of computing the sets Fro (F) 
for all subexpressions F of E. The set Fro (F) is represented by an array where 
the entries are indexed by symbols of §o- The computation of all sets Fr 0 (F) 
requires 0(| E |) time and space complexity. 

Now that we have computed the sets Fr 0 (), the second step consists of the 
construction of the First Forest. Recall that this First Forest encodes the Fr> (F) 
sets for all subexpressions F of E. Therefore, the set Fr> (F) can be obtained by a 
prefix traversal of the syntax tree of E in 0{ \ E |) time and space complexity. □ 

As each node v-p encodes Fr> (E„ F ) we can state the following lemma. 

Lemma 3. For a subexpression F of E the set Fr> (F) can be computed in 
0(| F |) time and space complexity. 

For a regular expression E, the following algorithm allows to compute the set 
Follow(E, fj,k) for a symbol fj £ and integers 1 < k < m. 


Algorithm 2: Algorithm for the function Follow for fj and k 
Input: Regular Expression E. 

Output: {Follow(E, fj,k ) | fj € , 1 < k < to}, 

i Calculate Follow(E, fj,k ) = Fl 0 (E, fj, k) ttl Fl> (E, fj, k) 

(1.1) for v = Vf :j to z/e do 
Compute Fl 0 (E„, fj, k); 

_ end for 

(1.2) Compute Fl> (E, fj. k); 

return (Fl 0 (E, fj,k) dJ Fl> (E, fj,k)) _ 

For each step of the Algorithm [2] we will evaluate the complexity in time and 
in space. 

We denote by T (fj) the sum of a ll ranks of symbols fj £ E > . 

fj&£> 

Step 1: Computation of Follow (E, fj, k ) = Flo (E, fj, k ) l±l Fl> (E, fj, k) 

We are interesting about the computation of the sets Fl 0 (E, fj, k) and Fl> (E, fj, k). 

Step 1.1: Computation of sets F1q (E„, fj, k ) 

At each node v of the syntax tree Te of E, the set Fl 0 (E v , fj,k) is represented 
by an array where the entries are indexed by symbols of Sq. The computation 
of the set Flo (E v ,fj,k) requires an 0(| E |) time and space complexity. 

Step 1.2: Computation o/Fl> (E ,fj,k) 
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Now that Flo (E„, fj, k) for all node v, such that Vf j =4 v i are computed, 
we can use the techniques outlined in the case of words to calculate the set 
Fl> (Ei,fj,k). Indeed, our formulas given in the Proposition 2] for the compu¬ 
tation of Fl> (E ,fj,k) are similar to that defined in the case of words |2ll6j . 
We have the same formulas so we can use the same algorithms used in the pa¬ 
per m for the computation of the sets Follow. Therefore, the computation of 
Fl> (E, fj,k) can be done in 0(\ E |) time complexity. 

We denote by R the maximal rank of symbols of £ appearing in E. Recall that 
the alphabetic width 11 E 11, of a regular expression E is the sum of occurrences 
of symbols of a rank greater than 0 appearing in E that is || E || = I E |/. 

The size of the ranked alphabet £ is considered as constant. 

Lemma 4. Let E be a regular expression, fj be a symbol in £® and 1 < k < m 
be two integers. The sets Follow(E, fj, k) for 1 < k < m can be computed in time 

0(r(/i)-|E|). 

As ( ^ (r(fj))) is bounded by (R • || E ||) we can state the following theorem. 

fj 


Theorem 3. The sets Follow(E, fj, k) for all symbols fj in A® and for all 1 < 
k < r (fj) can be computed with an 0(R • || E || • | E |) time complexity. 


4.3 Improving the computation of the function Follow 

In this section we present a simple transformation of the regular expression 
E which allows us to efficiently compute the sets Follow. For a subexpression 

m 

fjfEii, ■ ■ ■, E m ) of E and a symbol a in |^J Fr 0 (E*) we associate an expression EJ. 

i— 1 

obtained from E by replacing the subexpression /j(Ei,..., E m ) by the expression 

Example f. For the regular expression E = f(a + g(b), a + b+ h(a))* a •bl(b ). We 
get Ej = f(a)* a ■„ 1(b) and E b f = f(b)*“ ■„ 1(b) 

m 

For all subexpressions f 3 (E-i,... ,E m ) of E and for a symbol a € [J Fr 0 (Ej), the 

i— 1 

following proposition gives the link between Follow(E, fj, k) and Follow(E^, fj, 1). 

Proposition 6. Let E be a regular expression, fj( Ei,...,E m ) be a subexpres¬ 
sion of E and 1 < k < m be two integers. 

The set Follow(E, fj, k) can be computed as follows: 

Follow(E, fj, k) = Fr> (E fc ) W (J Follow(E^ ,fj, 1) 

aGFr 0 (E fc ) 
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Proof. For a subexpression fj( Ei,...,E n ) of E and from Proposition [5] the 
set Follow(E, fj, k) is of the form: Follow(E, fj, k ) = ((((First(E„ 0 ) - op ( Vl ) 
First(E 7(j/l) )) - op ( V2 ) First(E 7(t/2 ))) • • • • op ( l/m ) First(E 7(l/m) )) where v fj is the node 
of Te labelled by fj, u Q is the fc-child(^/ j ), r vf .{ E) = {v\,... ,v m } and for 
1 < i < m, op(vi) = c such that sym(father(^)) G {- c , * c }. 


Follow(E, fj, k) = ((((First(E^) - op{vi) First(E 7(l/l) )) - op{y2) First (E 7( „ 2) )) • • • • op(l/m) First(E 7(i , m) )) 
with E„ 0 = E fc 

= ((((Er> (E^ 0 ) W Fro (E„ 0 )) 'op( i/i) First(E 7 ( i/1 ))) ‘op(i/ 2 ) ''' 'o P (v m ) First(E 7 („ m ))) 
with E„ 0 = E fc and First(E,^) = Fr> (E„ 0 ) l±l Fr 0 (E„ 0 ) 

Fr> (E[/ 0 ) ttl (((Fro (Ey 0 ) • op ( I/1 ) First (E,^,,^)) • 0 p(i/ 2 ) ''' 'o P (v m ) First (E 7 ( I/m ^)) 

= Fr> (Ej/g) W ((( 1J ( a ) 'o P ( j/i) First(E 7 ( 1/1 ))) ' 0 p(i, 2 ) ■ ■ ■ ' op (v m ) First(E 7 ( i/m ))) 

aeFro (E„ 0 ) 


By using this last formula and the modifications: for all symbols a € [J Fr 0 (E„ 0 ) 
we associate an expression EJ; obtained from E by replacing the subexpression 
fj( Ei,..., E n ) by the expression fj(a), then we have for a G Fro (E„ 0 ): 


Follow(Ej., fj, 1) (((0 ' 0 p{v 1 ) First(E 7 ( 1/1 ))) 'op(i/ 2 ) ''' 'o P (v m ) First(E 7 ( 1/m j)) 


Therefore, for all symbols a G |^jFr 0 (E„ 0 ): 


Follow(E, fj, k) = Fr> (E„ 0 ) ttl (J Follow(E^, fj, 1) 

aeFr 0 (E„ 0 ) 


□ 


As the rank of the symbol fj in E^. is 1 and by Lemma[4l the set Follow(E^ , fj, 1) 
can be computed in time 0{ \ E |). This step is considered as a preprocessing and 

n 

is common to each symbol a such that a is in (^| Fro (Efc) for all 1 < k < n. 

k= 1 

So, one can compute in first time the sets Follow(E^., fj, 1) for all a in 

n 

[J Fr 0 (Efc) in 0(\ E |) time complexity. In the second time, from these sets and 

k=l 

the set Fr> (Efc) we construct the set Follow(E, fj,k) using formula of Proposi¬ 
tion E3 This second step can be performed in 0{\ Efc | + || E ||) time complexity. 
Indeed from Lemma [Q Fro (Efc) can be computed in time 0(|Efc |) and the set 
[J Follow(E(f, f v 1) can be constructed from the sets computed in the 

a£Fr 0 (E fe ) 

first step with an 0(|| E ||) time complexity. 
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As(^ | Efc |) < | E | and (r (fj) • || E ||) < | E | and as the first step is performed 

k=l 

n 


once for all k, 1 < k < n and for all a € 
following proposition. 


P| Fr 0 (Efc), then, we can state the 
fc=1 


Proposition 7. Let E be a regular expression and fj be a symbol in 17®. The 
set Follow(E, fj, k) for all 1 < k < r (fj) can be computed with an 0(|E |) time 
complexity. 


Finally we can state the following theorem. 


Theorem 4. Let E be a regular expression. The computation of the Follow sets 
for all symbol fj £ 17® and 1 < k < r (fj) can be done with an 0(|E| • ||E||) 
time complexity. 


Our algorithm for the computation of the Follow sets can be used for the com¬ 
putation of the set of transition rules of the fc-position automaton, the equation 
automaton im the fc-c-continuation automaton mm and the Follow au¬ 
tomaton [12] . 

Remark 2. By analogy to the word case, we have chosen to don’t consider the 
constant symbols (I7o) in the alphabetic width of E. For example for the regular 
expression E = / (a,..., a), || E || = 1. However, in [7], the alphabetic width is 

s. v ✓ 

a n-times 

the number of occurrences of symbols of S in E, that is || E || = n + 1. 


5 Conclusion 

In this paper the notion of fc-position tree automaton associated with the regular 
tree expression has been recalled. This automaton is the generalization from 
words to trees of the position automaton introduced by Glushkov. We give an 
efficient algorithm that computes the Follow function from a regular expression 
E in 0(|| E || • | E |) time complexity. 

This algorithm for the computation of the Follow sets can be used for the 
computation of the set of transitions of the fc-position, equation, fc-c-continuation 
and Follow automata. 
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